Introduction
The project will focus on an analysis of Fortune 1000 companies. In particular on Fortune 1000 list of 2022.
The Fortune 1000 are the 1000 largest American companies ranked by revenues for a specific year, as compiled by the magazine Fortune. The list only includes companies which are incorporated or authorized to do business in the United States, and for which revenues are publicly available. There are various types of companies coming from many different sectors and with different dimensions, from Big Tech companies such as Apple and Amazon to small companies.
While this list includes numerous prominent and highly successful companies in the country, it is important to note that it is not a flawless measure of success. Only publicly traded companies are considered, which means many privately-held businesses are excluded. Additionally, the list primarily relies on revenue generation rather than profitability as a criterion for inclusion.
Research Questions
I will focus on different topics and factors related to Fortune 1000 companies, trying to answer the following research questions.
Financial performances and sectors analysis:
Focus on the geographical dimension:
Company policies:
Datasets choice
In order to answer to the different research questions, I have selected various datasets from different sources.
Fortune 2022 Dataset is the main dataset for this project. I downloaded it from Kaggle (https://www.kaggle.com/datasets/winston56/fortune-500-data-2021). The source seems reliable because the dataset is created through web scraping of the Fortune official website.
Fortune 2022 second Dataset is a secondary dataset that has some additional columns referring to the Fortune 1000 list that can be added to the main dataset. The source is always Kaggle (https://www.kaggle.com/datasets/ramjasmaurya/fortune-1000-companieslatest) and the dataset is created through scraping of the Fortune website.
US GDP by State Dataset is a dataset that contains the data for GDP (Gross Domestic Product) for each state of the United States in 2021. The source is the BEA (Bureau of Economic Analysis), an official United States government source (https://shorturl.at/bzHNX).
US States Abbreviations Dataset is includes abbreviations for each state of the United States. Source (Kaggle): https://www.kaggle.com/datasets/giodev11/usstates-dataset?select=state-abbrevs.csv.
US Unemployment Rates by State Dataset is a dataset that contains unemployment rates from 2017 to 2021. Source (Kaggle): https://www.kaggle.com/datasets/pasicebear/us-unemployment-rates-per-state-20172021?select=unemployment_rates.csv.
US Population by State Dataset is a dataset that has US population datas by state from 2020 to 2022. The source is the United States Census Bureau: https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.
First of all, here I import all the libraries needed for all the activities that I will carry out in order to complete the project.
%pip install numpy pandas matplotlib plotly seaborn scipy geopandas openpyxl
import os
import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import copy as cp
import scipy.stats as stats
import geopandas as gpd
import folium
from branca.colormap import LinearColormap
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.24.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (5.14.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.12.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (1.10.1)
Requirement already satisfied: geopandas in /usr/local/lib/python3.10/dist-packages (0.12.2)
Collecting openpyxl
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 KB 5.0 MB/s eta 0:00:00
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.0.7)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.39.3)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly) (8.2.2)
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.10/dist-packages (from geopandas) (2.0.1)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.10/dist-packages (from geopandas) (1.9.2)
Requirement already satisfied: pyproj>=2.6.1.post1 in /usr/local/lib/python3.10/dist-packages (from geopandas) (3.5.0)
Collecting et-xmlfile
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: munch>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: click~=8.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (8.1.3)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (2022.12.7)
Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (22.2.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.
I read the main dataset that I will use for the project. It contains the list of Fortune 1000 companies along with many features of companies. Revenues, profits, and market cap are in millions of U.S. Dollars. .info() method shows pieces of information about the dataset columns, data types and missing values.
current_directory = os.getcwd()
csv_file_path = 'Fortune_1000.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)
Fortune_1000 = pd.read_csv(absolute_file_path, index_col='rank', encoding='utf-8')
Fortune_1000.info()
Fortune_1000
<class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 1 to 1000 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 company 1000 non-null object 1 rank_change 1000 non-null float64 2 revenue 1000 non-null float64 3 profit 997 non-null float64 4 num. of employees 999 non-null float64 5 sector 1000 non-null object 6 city 1000 non-null object 7 state 1000 non-null object 8 newcomer 1000 non-null object 9 ceo_founder 1000 non-null object 10 ceo_woman 1000 non-null object 11 profitable 1000 non-null object 12 prev_rank 1000 non-null object 13 CEO 1000 non-null object 14 Website 1000 non-null object 15 Ticker 951 non-null object 16 Market Cap 969 non-null object dtypes: float64(4), object(13) memory usage: 140.6+ KB
| company | rank_change | revenue | profit | num. of employees | sector | city | state | newcomer | ceo_founder | ceo_woman | profitable | prev_rank | CEO | Website | Ticker | Market Cap | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | |||||||||||||||||
| 1 | Walmart | 0.0 | 572754.0 | 13673.0 | 2300000.0 | Retailing | Bentonville | AR | no | no | no | yes | 1.0 | C. Douglas McMillon | https://www.stock.walmart.com | WMT | 352037 |
| 2 | Amazon | 0.0 | 469822.0 | 33364.0 | 1608000.0 | Retailing | Seattle | WA | no | no | no | yes | 2.0 | Andrew R. Jassy | www.amazon.com | AMZN | 1202717 |
| 3 | Apple | 0.0 | 365817.0 | 94680.0 | 154000.0 | Technology | Cupertino | CA | no | no | no | yes | 3.0 | Timothy D. Cook | www.apple.com | AAPL | 2443962 |
| 4 | CVS Health | 0.0 | 292111.0 | 7910.0 | 258000.0 | Health Care | Woonsocket | RI | no | no | yes | yes | 4.0 | Karen Lynch | https://www.cvshealth.com | CVS | 125204 |
| 5 | UnitedHealth Group | 0.0 | 287597.0 | 17285.0 | 350000.0 | Health Care | Minnetonka | MN | no | no | no | yes | 5.0 | Andrew P. Witty | www.unitedhealthgroup.com | UNH | 500468 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 996 | Vizio Holding | 0.0 | 2124.0 | -39.4 | 800.0 | Industrials | Irvine | CA | no | yes | no | no | William W. Wang | https://www.vizio.com | VZIO | 1705.1 | |
| 997 | 1-800-Flowers.com | 0.0 | 2122.2 | 118.7 | 4800.0 | Retailing | Jericho | NY | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com | FLWS | 830 | |
| 998 | Cowen | 0.0 | 2112.8 | 295.6 | 1534.0 | Financials | New York | NY | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com | COWN | 1078 | |
| 999 | Ashland | 0.0 | 2111.0 | 220.0 | 4100.0 | Chemicals | Wilmington | DE | no | no | no | yes | Guillermo Novo | https://www.ashland.com | ASH | 5601.9 | |
| 1000 | DocuSign | 0.0 | 2107.2 | -70.0 | 7461.0 | Technology | San Francisco | CA | no | no | no | no | Allan C. Thygesen | https://www.docusign.com | DOCU | 21302.8 |
1000 rows × 17 columns
As can be seen, there are some missing values. So, I search for NaN (Not A Number) values that could interfer the process of data analysis. I use isna() and sum() method in order to summarize the number of NaN values for each column.
Fortune_1000.isna().sum()
company 0 rank_change 0 revenue 0 profit 3 num. of employees 1 sector 0 city 0 state 0 newcomer 0 ceo_founder 0 ceo_woman 0 profitable 0 prev_rank 0 CEO 0 Website 0 Ticker 49 Market Cap 31 dtype: int64
I identify and print the rows in the Fortune_1000 data frame that have missing values.
Fortune_1000.loc[Fortune_1000.isna().any(axis=1), Fortune_1000.isna().any()]
| profit | num. of employees | Ticker | Market Cap | |
|---|---|---|---|---|
| rank | ||||
| 42 | 1280.9 | 53586.0 | NaN | NaN |
| 72 | 277.1 | 14344.0 | NaN | - |
| 76 | 4412.2 | 232000.0 | NaN | - |
| 78 | 3068.0 | 45000.0 | NaN | NaN |
| 80 | 1617.2 | 24134.0 | NaN | NaN |
| 90 | 4060.7 | 15065.0 | NaN | - |
| 95 | 554.0 | 9941.0 | NaN | NaN |
| 96 | 3300.0 | 37335.0 | NaN | - |
| 97 | 977.8 | 7585.0 | NaN | - |
| 100 | 319.3 | 10052.0 | NaN | - |
| 232 | 287.4 | 9000.0 | NaN | NaN |
| 247 | 288.8 | 9150.0 | NaN | NaN |
| 251 | 734.5 | 13590.0 | NaN | NaN |
| 264 | -366.8 | 11787.0 | NaN | NaN |
| 266 | 1102.0 | 3861.0 | NaN | NaN |
| 303 | 1605.0 | 50000.0 | NaN | - |
| 313 | 753.0 | 28800.0 | NaN | NaN |
| 324 | 579.0 | 5739.0 | NaN | NaN |
| 339 | 1318.4 | 6441.0 | NaN | NaN |
| 351 | 2530.2 | 3256.0 | NaN | NaN |
| 368 | 1130.1 | 5805.0 | NaN | NaN |
| 372 | 361.2 | 2538.0 | NaN | NaN |
| 399 | 262.4 | 8800.0 | NaN | - |
| 418 | 2842.4 | 5560.0 | NaN | NaN |
| 464 | 311.5 | 6941.0 | NaN | NaN |
| 499 | NaN | 15640.0 | FYBR | 6763.1 |
| 535 | NaN | 1300.0 | CHK | 11135.9 |
| 539 | 555.6 | 25150.0 | NaN | 27492.2 |
| 559 | 665.9 | 21700.0 | NaN | 3088.7 |
| 597 | -221.4 | 7500.0 | NaN | 30976.8 |
| 602 | 622.3 | 4100.0 | NaN | NaN |
| 623 | 455.1 | 17662.0 | NaN | 4185.4 |
| 647 | 522.9 | 3194.0 | NaN | NaN |
| 650 | 739.9 | 4064.0 | NaN | NaN |
| 659 | -165.1 | 8900.0 | NaN | NaN |
| 660 | 699.3 | 4545.0 | NaN | 5084.1 |
| 680 | -27.0 | 6100.0 | NaN | 1856.2 |
| 690 | 419.2 | 3149.0 | NaN | NaN |
| 710 | 199.0 | 9600.0 | NaN | 2519.4 |
| 714 | 260.8 | 3185.0 | NaN | NaN |
| 736 | NaN | 7600.0 | GTX | 463.9 |
| 748 | 78.3 | 2834.0 | NaN | NaN |
| 775 | 392.1 | 1157.0 | NaN | NaN |
| 789 | 213.8 | 1196.0 | NaN | 3056.2 |
| 815 | -112.2 | 15000.0 | NaN | 6799.3 |
| 866 | 133.7 | NaN | NaN | NaN |
| 874 | -104.2 | 2952.0 | NaN | 10463.5 |
| 878 | 50.7 | 1146.0 | NaN | NaN |
| 880 | 117.1 | 3510.0 | NaN | NaN |
| 907 | 320.1 | 2702.0 | NaN | NaN |
| 925 | 130.3 | 749.0 | NaN | NaN |
| 948 | -85.2 | 1082.0 | NaN | NaN |
Since I won't use the Ticker values, that are just abbreviations used to identify publicly traded companies, and there are 49 NaN values in that column, I remove it from the dataframe. I do the same with the Market Cap column because it has many NaN values and I won't use it for the analysis. I also remove rank_change and prev_rank columns because these data are correlated to 2021 list so some data (such as rank in 2021 for a company that was not in the list) might be misleading.
Fortune_1000 = Fortune_1000.drop("Ticker", axis=1)
Fortune_1000 = Fortune_1000.drop("Market Cap", axis=1)
Fortune_1000 = Fortune_1000.drop("rank_change", axis=1)
Fortune_1000 = Fortune_1000.drop("prev_rank", axis=1)
Fortune_1000
| company | revenue | profit | num. of employees | sector | city | state | newcomer | ceo_founder | ceo_woman | profitable | CEO | Website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | |||||||||||||
| 1 | Walmart | 572754.0 | 13673.0 | 2300000.0 | Retailing | Bentonville | AR | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com |
| 2 | Amazon | 469822.0 | 33364.0 | 1608000.0 | Retailing | Seattle | WA | no | no | no | yes | Andrew R. Jassy | www.amazon.com |
| 3 | Apple | 365817.0 | 94680.0 | 154000.0 | Technology | Cupertino | CA | no | no | no | yes | Timothy D. Cook | www.apple.com |
| 4 | CVS Health | 292111.0 | 7910.0 | 258000.0 | Health Care | Woonsocket | RI | no | no | yes | yes | Karen Lynch | https://www.cvshealth.com |
| 5 | UnitedHealth Group | 287597.0 | 17285.0 | 350000.0 | Health Care | Minnetonka | MN | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 996 | Vizio Holding | 2124.0 | -39.4 | 800.0 | Industrials | Irvine | CA | no | yes | no | no | William W. Wang | https://www.vizio.com |
| 997 | 1-800-Flowers.com | 2122.2 | 118.7 | 4800.0 | Retailing | Jericho | NY | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com |
| 998 | Cowen | 2112.8 | 295.6 | 1534.0 | Financials | New York | NY | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com |
| 999 | Ashland | 2111.0 | 220.0 | 4100.0 | Chemicals | Wilmington | DE | no | no | no | yes | Guillermo Novo | https://www.ashland.com |
| 1000 | DocuSign | 2107.2 | -70.0 | 7461.0 | Technology | San Francisco | CA | no | no | no | no | Allan C. Thygesen | https://www.docusign.com |
1000 rows × 13 columns
I search for NaN values in the columns profit and num. of employees in order to fill them manually since they are just 4 values that can be easily found online.
Fortune_1000[Fortune_1000["profit"].isna()]
| company | revenue | profit | num. of employees | sector | city | state | newcomer | ceo_founder | ceo_woman | profitable | CEO | Website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | |||||||||||||
| 499 | Frontier Communications | 6411.0 | NaN | 15640.0 | Telecommunications | Norwalk | CT | no | no | no | no | Nick Jeffery | https://www.frontier.com |
| 535 | Chesapeake Energy | 5809.0 | NaN | 1300.0 | Energy | Oklahoma City | OK | no | no | no | no | Domenic J. Dell'Osso Jr. | https://www.chk.com |
| 736 | Garrett Motion | 3633.0 | NaN | 7600.0 | Motor Vehicles & Parts | Plymouth | MI | no | no | no | no | Olivier Rabiller | https://www.garrettmotion.com |
Fortune_1000[Fortune_1000["num. of employees"].isna()]
| company | revenue | profit | num. of employees | sector | city | state | newcomer | ceo_founder | ceo_woman | profitable | CEO | Website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | |||||||||||||
| 866 | Univision Communciations | 2841.0 | 133.7 | NaN | Media | New York | NY | no | no | no | yes | Wade C. Davis | https://https://corporate.televisaunivision.com |
nan_indices = Fortune_1000.index[Fortune_1000["profit"].isna()].tolist()
fill_values = [441.0, 945.0, 398.0]
Fortune_1000.loc[nan_indices, "profit"] = fill_values
Fortune_1000["num. of employees"].fillna(4500.0, inplace=True)
Fortune_1000["num. of employees"] = Fortune_1000["num. of employees"].astype(int)
Fortune_1000.isna().sum()
company 0 revenue 0 profit 0 num. of employees 0 sector 0 city 0 state 0 newcomer 0 ceo_founder 0 ceo_woman 0 profitable 0 CEO 0 Website 0 dtype: int64
Now the dataset has no NaN values...
Fortune_1000.dtypes
company object revenue float64 profit float64 num. of employees int64 sector object city object state object newcomer object ceo_founder object ceo_woman object profitable object CEO object Website object dtype: object
...and data types are the desired ones. So, the dataset is more suitable for the scope of the analysis, but I want to rename columns just to make everything more clear and readable.
Fortune_1000 = Fortune_1000.rename(columns={"company": "Company", "revenue": "Revenue", "profit": "Profit", "num. of employees": "Employees", "sector": "Sector", "city": "City", "state": "State", "newcomer": "New_Entry", "ceo_founder": "CEO_Founder", "ceo_woman": "CEO_Woman", "profitable": "Profitable", "CEO": "CEO", "Website": "Website"})
Fortune_1000
| Company | Revenue | Profit | Employees | Sector | City | State | New_Entry | CEO_Founder | CEO_Woman | Profitable | CEO | Website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rank | |||||||||||||
| 1 | Walmart | 572754.0 | 13673.0 | 2300000 | Retailing | Bentonville | AR | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com |
| 2 | Amazon | 469822.0 | 33364.0 | 1608000 | Retailing | Seattle | WA | no | no | no | yes | Andrew R. Jassy | www.amazon.com |
| 3 | Apple | 365817.0 | 94680.0 | 154000 | Technology | Cupertino | CA | no | no | no | yes | Timothy D. Cook | www.apple.com |
| 4 | CVS Health | 292111.0 | 7910.0 | 258000 | Health Care | Woonsocket | RI | no | no | yes | yes | Karen Lynch | https://www.cvshealth.com |
| 5 | UnitedHealth Group | 287597.0 | 17285.0 | 350000 | Health Care | Minnetonka | MN | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 996 | Vizio Holding | 2124.0 | -39.4 | 800 | Industrials | Irvine | CA | no | yes | no | no | William W. Wang | https://www.vizio.com |
| 997 | 1-800-Flowers.com | 2122.2 | 118.7 | 4800 | Retailing | Jericho | NY | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com |
| 998 | Cowen | 2112.8 | 295.6 | 1534 | Financials | New York | NY | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com |
| 999 | Ashland | 2111.0 | 220.0 | 4100 | Chemicals | Wilmington | DE | no | no | no | yes | Guillermo Novo | https://www.ashland.com |
| 1000 | DocuSign | 2107.2 | -70.0 | 7461 | Technology | San Francisco | CA | no | no | no | no | Allan C. Thygesen | https://www.docusign.com |
1000 rows × 13 columns
I read another dataset containing Fortune 1000 2022 companies. I will use it just to import the assets column that is useful for my analysis. As the previous dataset, revenues, profits, assets, and market values are in millions of U.S. Dollars.
csv_file_path = 'fortune 1000 companies in 2022 - 2022.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)
Fortune_1000_sec = pd.read_csv(absolute_file_path, index_col = 'rank in 2022', encoding='utf-8')
Fortune_1000_sec.info()
Fortune_1000_sec
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1,000
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1000 non-null object
1 Revenue 1000 non-null object
2 revenue(% change) 1000 non-null object
3 profits in millions 1000 non-null object
1000 non-null object
5 assets 1000 non-null object
6 market value 1000 non-null object
7 change in rank of top 1000 companies 1000 non-null object
8 employees 1000 non-null object
9 change in rank(500 only)
1000 non-null object
dtypes: object(10)
memory usage: 85.9+ KB
| Name | Revenue | revenue(% change) | profits in millions | profits % change\r | assets | market value | change in rank of top 1000 companies | employees | change in rank(500 only)\r\n | |
|---|---|---|---|---|---|---|---|---|---|---|
| rank in 2022 | ||||||||||
| 1 | Walmart | $572,754 | 2.40% | $13,673 | 1.20% | $244,860 | $409,795 | - | 2,300,000 | - |
| 2 | Amazon | $469,822 | 21.70% | $33,364 | 56.40% | $420,549 | $1,658,807.30 | - | 1,608,000 | - |
| 3 | Apple | $365,817 | 33.30% | $94,680 | 64.90% | $351,002 | $2,849,537.60 | - | 154,000 | - |
| 4 | CVS Health | $292,111 | 8.70% | $7,910 | 10.20% | $232,999 | $132,839.20 | - | 258,000 | - |
| 5 | UnitedHealth Group | $287,597 | 11.80% | $17,285 | 12.20% | $212,206 | $479,830.30 | - | 350,000 | - |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 996 | Vizio Holding | $2,124 | 4% | -$39.40 | -138.40% | $935.80 | $1,705.10 | - | 800 | - |
| 997 | 1-800-Flowers.com | $2,122.20 | 42.50% | $118.70 | 101.10% | $1,076.70 | $830 | - | 4,800 | - |
| 998 | Cowen | $2,112.80 | 30.20% | $295.60 | 36.60% | $8,748.80 | $744.10 | - | 1,534 | - |
| 999 | Ashland Global Holdings | $2,111 | -11.20% | $220 | - | $6,612 | $5,601.90 | -130 | 4,100 | - |
| 1,000 | DocuSign | $2,107.20 | 45% | -$70 | - | $2,541.30 | $21,302.80 | - | 7,461 | - |
1000 rows × 10 columns
I search for NaN values and, apparently, there are no NaN values.
Fortune_1000_sec.isna().sum()
Name 0 Revenue 0 revenue(% change) 0 profits in millions 0 profits % change\r 0 assets 0 market value 0 change in rank of top 1000 companies 0 employees 0 change in rank(500 only)\r\n 0 dtype: int64
But, the fact is that the dataset missing values are not represented as NaN values but as "-". If I count the number of "-" in the data frame, I can see that, actually, there are many missing values.
count_missing_values = (Fortune_1000_sec.applymap(lambda x: '-' in str(x))).sum().sum()
print(count_missing_values)
2193
Quite many missing values. Since I'm only interested in the asset column, I search for missing values in that column.
count_assets = Fortune_1000_sec["assets"].str.count("-").sum()
print(count_assets)
0
There are no missing values in this column. I check the data type of the column assets.
assets_dataypes = Fortune_1000_sec["assets"].dtypes
print(assets_dataypes)
object
The data type is object so I want to convert it to float, but I can't because there are some symbols ("$" and ",") that would make it impossible to convert to float dtypes. I remove these symbols and then convert values to float.
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace("$", "")
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace(",", "")
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].astype(float)
print(Fortune_1000_sec["assets"])
rank in 2022
1 244860.0
2 420549.0
3 351002.0
4 232999.0
5 212206.0
...
996 935.8
997 1076.7
998 8748.8
999 6612.0
1,000 2541.3
Name: assets, Length: 1000, dtype: float64
/tmp/ipykernel_15/3639834353.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
Fortune_1000_sec["assets"] = Fortune_1000_sec["assets"].str.replace("$", "")
I read a dataset containing 2021 GDP of each state of United States in millions of U.S. dollars.
csv_file_path = 'US GDP by State 2021.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)
US_GDP_State = pd.read_csv(absolute_file_path,encoding='utf-8',skiprows=3,skipfooter=13,engine='python')
US_GDP_State.info()
US_GDP_State
<class 'pandas.core.frame.DataFrame'> RangeIndex: 52 entries, 0 to 51 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GeoFips 52 non-null int64 1 GeoName 52 non-null object 2 2021 52 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 1.3+ KB
| GeoFips | GeoName | 2021 | |
|---|---|---|---|
| 0 | 0 | United States * | 23315081.0 |
| 1 | 1000 | Alabama | 254109.7 |
| 2 | 2000 | Alaska | 57349.4 |
| 3 | 4000 | Arizona | 420026.7 |
| 4 | 5000 | Arkansas | 148676.1 |
| 5 | 6000 | California | 3373240.7 |
| 6 | 8000 | Colorado | 436359.5 |
| 7 | 9000 | Connecticut | 298395.2 |
| 8 | 10000 | Delaware | 81160.0 |
| 9 | 11000 | District of Columbia | 153670.5 |
| 10 | 12000 | Florida | 1255558.3 |
| 11 | 13000 | Georgia | 691626.9 |
| 12 | 15000 | Hawaii | 91096.1 |
| 13 | 16000 | Idaho | 96282.8 |
| 14 | 17000 | Illinois | 945673.8 |
| 15 | 18000 | Indiana | 412975.2 |
| 16 | 19000 | Iowa | 216860.2 |
| 17 | 20000 | Kansas | 191380.6 |
| 18 | 21000 | Kentucky | 237182.0 |
| 19 | 22000 | Louisiana | 258571.3 |
| 20 | 23000 | Maine | 77963.3 |
| 21 | 24000 | Maryland | 443929.9 |
| 22 | 25000 | Massachusetts | 641332.2 |
| 23 | 26000 | Michigan | 572205.8 |
| 24 | 27000 | Minnesota | 412458.6 |
| 25 | 28000 | Mississippi | 127307.7 |
| 26 | 29000 | Missouri | 358572.0 |
| 27 | 30000 | Montana | 58699.8 |
| 28 | 31000 | Nebraska | 146285.4 |
| 29 | 32000 | Nevada | 194486.6 |
| 30 | 33000 | New Hampshire | 99673.3 |
| 31 | 34000 | New Jersey | 682945.9 |
| 32 | 35000 | New Mexico | 109582.8 |
| 33 | 36000 | New York | 1901296.5 |
| 34 | 37000 | North Carolina | 662120.8 |
| 35 | 38000 | North Dakota | 63559.6 |
| 36 | 39000 | Ohio | 756617.2 |
| 37 | 40000 | Oklahoma | 215336.3 |
| 38 | 41000 | Oregon | 272190.9 |
| 39 | 42000 | Pennsylvania | 844496.5 |
| 40 | 44000 | Rhode Island | 66570.9 |
| 41 | 45000 | South Carolina | 269802.5 |
| 42 | 46000 | South Dakota | 61684.7 |
| 43 | 47000 | Tennessee | 427125.5 |
| 44 | 48000 | Texas | 2051768.6 |
| 45 | 49000 | Utah | 225340.3 |
| 46 | 50000 | Vermont | 37103.8 |
| 47 | 51000 | Virginia | 604957.6 |
| 48 | 53000 | Washington | 677489.5 |
| 49 | 54000 | West Virginia | 85434.2 |
| 50 | 55000 | Wisconsin | 368611.3 |
| 51 | 56000 | Wyoming | 41510.2 |
US_GDP_State.isna().sum()
GeoFips 0 GeoName 0 2021 0 dtype: int64
There are no missing values. Now I remove the first row that is about the United States in general and I remove the column GeoFips that is not useful.
US_GDP_State = US_GDP_State.drop(US_GDP_State.loc[US_GDP_State["GeoName"]=="United States *"].index)
US_GDP_State = US_GDP_State.drop("GeoFips", axis=1)
US_GDP_State.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51 entries, 1 to 51 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GeoName 51 non-null object 1 2021 51 non-null float64 dtypes: float64(1), object(1) memory usage: 948.0+ bytes
The dataframe contains 51 rows (50 states + the District of Columbia). In order to decide whether to keep or not the District of Columbia in the list of states, I check if there are companies from this district in the main dataset (Fortune_1000).
(Fortune_1000['State'] == 'DC').sum()
5
Actually, there are 5 companies, so I decide to keep DC in the states of US list. I rename the dataframe columns just so that they can be more clear. I also reset the index starting from 0.
US_GDP_State = US_GDP_State.rename(columns={'GeoName': 'State', '2021': 'GDP_2021'})
US_GDP_State.reset_index(drop=True, inplace=True)
US_GDP_State
| State | GDP_2021 | |
|---|---|---|
| 0 | Alabama | 254109.7 |
| 1 | Alaska | 57349.4 |
| 2 | Arizona | 420026.7 |
| 3 | Arkansas | 148676.1 |
| 4 | California | 3373240.7 |
| 5 | Colorado | 436359.5 |
| 6 | Connecticut | 298395.2 |
| 7 | Delaware | 81160.0 |
| 8 | District of Columbia | 153670.5 |
| 9 | Florida | 1255558.3 |
| 10 | Georgia | 691626.9 |
| 11 | Hawaii | 91096.1 |
| 12 | Idaho | 96282.8 |
| 13 | Illinois | 945673.8 |
| 14 | Indiana | 412975.2 |
| 15 | Iowa | 216860.2 |
| 16 | Kansas | 191380.6 |
| 17 | Kentucky | 237182.0 |
| 18 | Louisiana | 258571.3 |
| 19 | Maine | 77963.3 |
| 20 | Maryland | 443929.9 |
| 21 | Massachusetts | 641332.2 |
| 22 | Michigan | 572205.8 |
| 23 | Minnesota | 412458.6 |
| 24 | Mississippi | 127307.7 |
| 25 | Missouri | 358572.0 |
| 26 | Montana | 58699.8 |
| 27 | Nebraska | 146285.4 |
| 28 | Nevada | 194486.6 |
| 29 | New Hampshire | 99673.3 |
| 30 | New Jersey | 682945.9 |
| 31 | New Mexico | 109582.8 |
| 32 | New York | 1901296.5 |
| 33 | North Carolina | 662120.8 |
| 34 | North Dakota | 63559.6 |
| 35 | Ohio | 756617.2 |
| 36 | Oklahoma | 215336.3 |
| 37 | Oregon | 272190.9 |
| 38 | Pennsylvania | 844496.5 |
| 39 | Rhode Island | 66570.9 |
| 40 | South Carolina | 269802.5 |
| 41 | South Dakota | 61684.7 |
| 42 | Tennessee | 427125.5 |
| 43 | Texas | 2051768.6 |
| 44 | Utah | 225340.3 |
| 45 | Vermont | 37103.8 |
| 46 | Virginia | 604957.6 |
| 47 | Washington | 677489.5 |
| 48 | West Virginia | 85434.2 |
| 49 | Wisconsin | 368611.3 |
| 50 | Wyoming | 41510.2 |
The following dataset contains abbreviations for each state of United States.
csv_file_path = 'US States abbrevs.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)
US_States = pd.read_csv(absolute_file_path,encoding='utf-8')
US_States.info()
US_States
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51 entries, 0 to 50 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 51 non-null object 1 abbreviation 51 non-null object dtypes: object(2) memory usage: 944.0+ bytes
| state | abbreviation | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
| 5 | Colorado | CO |
| 6 | Connecticut | CT |
| 7 | Delaware | DE |
| 8 | District of Columbia | DC |
| 9 | Florida | FL |
| 10 | Georgia | GA |
| 11 | Hawaii | HI |
| 12 | Idaho | ID |
| 13 | Illinois | IL |
| 14 | Indiana | IN |
| 15 | Iowa | IA |
| 16 | Kansas | KS |
| 17 | Kentucky | KY |
| 18 | Louisiana | LA |
| 19 | Maine | ME |
| 20 | Montana | MT |
| 21 | Nebraska | NE |
| 22 | Nevada | NV |
| 23 | New Hampshire | NH |
| 24 | New Jersey | NJ |
| 25 | New Mexico | NM |
| 26 | New York | NY |
| 27 | North Carolina | NC |
| 28 | North Dakota | ND |
| 29 | Ohio | OH |
| 30 | Oklahoma | OK |
| 31 | Oregon | OR |
| 32 | Maryland | MD |
| 33 | Massachusetts | MA |
| 34 | Michigan | MI |
| 35 | Minnesota | MN |
| 36 | Mississippi | MS |
| 37 | Missouri | MO |
| 38 | Pennsylvania | PA |
| 39 | Rhode Island | RI |
| 40 | South Carolina | SC |
| 41 | South Dakota | SD |
| 42 | Tennessee | TN |
| 43 | Texas | TX |
| 44 | Utah | UT |
| 45 | Vermont | VT |
| 46 | Virginia | VA |
| 47 | Washington | WA |
| 48 | West Virginia | WV |
| 49 | Wisconsin | WI |
| 50 | Wyoming | WY |
As always, I check for missing values and rename the columns.
US_States.isna().sum()
state 0 abbreviation 0 dtype: int64
US_States = US_States.rename(columns={'state': 'State', 'abbreviation': 'Abbreviation'})
US_States
| State | Abbreviation | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
| 5 | Colorado | CO |
| 6 | Connecticut | CT |
| 7 | Delaware | DE |
| 8 | District of Columbia | DC |
| 9 | Florida | FL |
| 10 | Georgia | GA |
| 11 | Hawaii | HI |
| 12 | Idaho | ID |
| 13 | Illinois | IL |
| 14 | Indiana | IN |
| 15 | Iowa | IA |
| 16 | Kansas | KS |
| 17 | Kentucky | KY |
| 18 | Louisiana | LA |
| 19 | Maine | ME |
| 20 | Montana | MT |
| 21 | Nebraska | NE |
| 22 | Nevada | NV |
| 23 | New Hampshire | NH |
| 24 | New Jersey | NJ |
| 25 | New Mexico | NM |
| 26 | New York | NY |
| 27 | North Carolina | NC |
| 28 | North Dakota | ND |
| 29 | Ohio | OH |
| 30 | Oklahoma | OK |
| 31 | Oregon | OR |
| 32 | Maryland | MD |
| 33 | Massachusetts | MA |
| 34 | Michigan | MI |
| 35 | Minnesota | MN |
| 36 | Mississippi | MS |
| 37 | Missouri | MO |
| 38 | Pennsylvania | PA |
| 39 | Rhode Island | RI |
| 40 | South Carolina | SC |
| 41 | South Dakota | SD |
| 42 | Tennessee | TN |
| 43 | Texas | TX |
| 44 | Utah | UT |
| 45 | Vermont | VT |
| 46 | Virginia | VA |
| 47 | Washington | WA |
| 48 | West Virginia | WV |
| 49 | Wisconsin | WI |
| 50 | Wyoming | WY |
I read the dataset containing unemployment rates by state from 2017 to 2022.
csv_file_path = 'unemployment_rates.csv'
absolute_file_path = os.path.join(current_directory, csv_file_path)
US_Unemployment = pd.read_csv(absolute_file_path,encoding='utf-8')
US_Unemployment.info()
US_Unemployment
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3621 entries, 0 to 3620 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 first_day_of_month 3621 non-null object 1 state 3621 non-null object 2 unemployment_rate 3621 non-null float64 dtypes: float64(1), object(2) memory usage: 85.0+ KB
| first_day_of_month | state | unemployment_rate | |
|---|---|---|---|
| 0 | 2017-01-01 | California | 5.2 |
| 1 | 2017-02-01 | California | 5.1 |
| 2 | 2017-03-01 | California | 5.0 |
| 3 | 2017-04-01 | California | 5.0 |
| 4 | 2017-05-01 | California | 4.9 |
| ... | ... | ... | ... |
| 3616 | 2022-07-01 | District of Columbia | 5.2 |
| 3617 | 2022-08-01 | District of Columbia | 5.1 |
| 3618 | 2022-09-01 | District of Columbia | 4.7 |
| 3619 | 2022-10-01 | District of Columbia | 4.8 |
| 3620 | 2022-11-01 | District of Columbia | 4.6 |
3621 rows × 3 columns
US_Unemployment.isna().sum()
first_day_of_month 0 state 0 unemployment_rate 0 dtype: int64
There are no NaN values, but since I have many data about years that are not interesting for scope of my research, I have to clean the dataset from all values regarding 2017-2020 and 2022. I'm only interested in 2021 information because Fortune 1000 2022 refers to revenues of the year 2021.
I convert the column first_day_of_month to datatime, then I create a boolean mask to filter only rows for the year 2021 and I apply it to the dataframe.
US_Unemployment['first_day_of_month'] = pd.to_datetime(US_Unemployment['first_day_of_month'])
mask = US_Unemployment['first_day_of_month'].dt.year == 2021
US_Unemployment = US_Unemployment[mask]
US_Unemployment.info()
US_Unemployment
<class 'pandas.core.frame.DataFrame'> Int64Index: 612 entries, 48 to 3609 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 first_day_of_month 612 non-null datetime64[ns] 1 state 612 non-null object 2 unemployment_rate 612 non-null float64 dtypes: datetime64[ns](1), float64(1), object(1) memory usage: 19.1+ KB
| first_day_of_month | state | unemployment_rate | |
|---|---|---|---|
| 48 | 2021-01-01 | California | 8.8 |
| 49 | 2021-02-01 | California | 8.6 |
| 50 | 2021-03-01 | California | 8.4 |
| 51 | 2021-04-01 | California | 8.3 |
| 52 | 2021-05-01 | California | 7.9 |
| ... | ... | ... | ... |
| 3605 | 2021-08-01 | District of Columbia | 6.9 |
| 3606 | 2021-09-01 | District of Columbia | 6.5 |
| 3607 | 2021-10-01 | District of Columbia | 6.3 |
| 3608 | 2021-11-01 | District of Columbia | 6.1 |
| 3609 | 2021-12-01 | District of Columbia | 6.1 |
612 rows × 3 columns
Since I have monthly unemployment rates, I want to convert them to an unique annual unemployment rate. I extract the year from the column first_day_of_month and group the data by state and year, calculating the average unempolyment rate because I think it is the best way to summarize all the monthly data that are at disposal. I can also remove the column year because I only have data about 2021. I rename other columns.
US_Unemployment['year'] = US_Unemployment['first_day_of_month'].dt.year
US_Unemployment = US_Unemployment.groupby(['state', 'year'])['unemployment_rate'].mean().reset_index()
US_Unemployment.drop(columns=['year'], inplace=True)
US_Unemployment.rename(columns={'state': 'State', 'unemployment_rate': 'Unemployment'}, inplace=True)
US_Unemployment
/tmp/ipykernel_15/1859975808.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy US_Unemployment['year'] = US_Unemployment['first_day_of_month'].dt.year
| State | Unemployment | |
|---|---|---|
| 0 | Alabama | 3.450000 |
| 1 | Alaska | 6.425000 |
| 2 | Arizona | 4.950000 |
| 3 | Arkansas | 4.058333 |
| 4 | California | 7.366667 |
| 5 | Colorado | 5.408333 |
| 6 | Connecticut | 6.300000 |
| 7 | Delaware | 5.391667 |
| 8 | District of Columbia | 6.675000 |
| 9 | Florida | 4.600000 |
| 10 | Georgia | 3.941667 |
| 11 | Hawaii | 5.775000 |
| 12 | Idaho | 3.591667 |
| 13 | Illinois | 6.125000 |
| 14 | Indiana | 3.616667 |
| 15 | Iowa | 4.275000 |
| 16 | Kansas | 3.283333 |
| 17 | Kentucky | 4.708333 |
| 18 | Louisiana | 5.475000 |
| 19 | Maine | 4.633333 |
| 20 | Maryland | 5.783333 |
| 21 | Massachusetts | 5.758333 |
| 22 | Michigan | 5.933333 |
| 23 | Minnesota | 3.425000 |
| 24 | Mississippi | 5.625000 |
| 25 | Missouri | 4.375000 |
| 26 | Montana | 3.400000 |
| 27 | Nebraska | 2.516667 |
| 28 | Nevada | 7.250000 |
| 29 | New Hampshire | 3.533333 |
| 30 | New Jersey | 6.391667 |
| 31 | New Mexico | 6.808333 |
| 32 | New York | 6.958333 |
| 33 | North Carolina | 4.841667 |
| 34 | North Dakota | 3.683333 |
| 35 | Ohio | 5.166667 |
| 36 | Oklahoma | 3.891667 |
| 37 | Oregon | 5.241667 |
| 38 | Pennsylvania | 6.416667 |
| 39 | Rhode Island | 5.675000 |
| 40 | South Carolina | 4.016667 |
| 41 | South Dakota | 3.133333 |
| 42 | Tennessee | 4.316667 |
| 43 | Texas | 5.708333 |
| 44 | Utah | 2.716667 |
| 45 | Vermont | 3.450000 |
| 46 | Virginia | 3.925000 |
| 47 | Washington | 5.283333 |
| 48 | West Virginia | 5.075000 |
| 49 | Wisconsin | 3.816667 |
| 50 | Wyoming | 4.516667 |
I read and clean the last dataset. It contains data about US population by state from 2020 to 2022.
excel_file_path = 'NST-EST2022-POP.xlsx'
absolute_file_path = os.path.join(current_directory, excel_file_path)
US_Population = pd.read_excel(absolute_file_path,skiprows=3,skipfooter=7)
US_Population.info()
US_Population
<class 'pandas.core.frame.DataFrame'> RangeIndex: 56 entries, 0 to 55 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 56 non-null object 1 Unnamed: 1 56 non-null int64 2 2020 56 non-null int64 3 2021 56 non-null int64 4 2022 56 non-null int64 dtypes: int64(4), object(1) memory usage: 2.3+ KB
| Unnamed: 0 | Unnamed: 1 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|
| 0 | United States | 331449520 | 331511512 | 332031554 | 333287557 |
| 1 | Northeast | 57609156 | 57448898 | 57259257 | 57040406 |
| 2 | Midwest | 68985537 | 68961043 | 68836505 | 68787595 |
| 3 | South | 126266262 | 126450613 | 127346029 | 128716192 |
| 4 | West | 78588565 | 78650958 | 78589763 | 78743364 |
| 5 | .Alabama | 5024356 | 5031362 | 5049846 | 5074296 |
| 6 | .Alaska | 733378 | 732923 | 734182 | 733583 |
| 7 | .Arizona | 7151507 | 7179943 | 7264877 | 7359197 |
| 8 | .Arkansas | 3011555 | 3014195 | 3028122 | 3045637 |
| 9 | .California | 39538245 | 39501653 | 39142991 | 39029342 |
| 10 | .Colorado | 5773733 | 5784865 | 5811297 | 5839926 |
| 11 | .Connecticut | 3605942 | 3597362 | 3623355 | 3626205 |
| 12 | .Delaware | 989957 | 992114 | 1004807 | 1018396 |
| 13 | .District of Columbia | 689546 | 670868 | 668791 | 671803 |
| 14 | .Florida | 21538226 | 21589602 | 21828069 | 22244823 |
| 15 | .Georgia | 10711937 | 10729828 | 10788029 | 10912876 |
| 16 | .Hawaii | 1455273 | 1451043 | 1447154 | 1440196 |
| 17 | .Idaho | 1839092 | 1849202 | 1904314 | 1939033 |
| 18 | .Illinois | 12812545 | 12786580 | 12686469 | 12582032 |
| 19 | .Indiana | 6785668 | 6788799 | 6813532 | 6833037 |
| 20 | .Iowa | 3190372 | 3190571 | 3197689 | 3200517 |
| 21 | .Kansas | 2937847 | 2937919 | 2937922 | 2937150 |
| 22 | .Kentucky | 4505893 | 4507445 | 4506589 | 4512310 |
| 23 | .Louisiana | 4657749 | 4651664 | 4627098 | 4590241 |
| 24 | .Maine | 1362341 | 1363557 | 1377238 | 1385340 |
| 25 | .Maryland | 6177213 | 6173205 | 6174610 | 6164660 |
| 26 | .Massachusetts | 7029949 | 6995729 | 6989690 | 6981974 |
| 27 | .Michigan | 10077325 | 10069577 | 10037504 | 10034113 |
| 28 | .Minnesota | 5706504 | 5709852 | 5711471 | 5717184 |
| 29 | .Mississippi | 2961288 | 2958141 | 2949586 | 2940057 |
| 30 | .Missouri | 6154920 | 6153998 | 6169823 | 6177957 |
| 31 | .Montana | 1084197 | 1087075 | 1106227 | 1122867 |
| 32 | .Nebraska | 1961489 | 1962642 | 1963554 | 1967923 |
| 33 | .Nevada | 3104624 | 3115648 | 3146402 | 3177772 |
| 34 | .New Hampshire | 1377518 | 1378587 | 1387505 | 1395231 |
| 35 | .New Jersey | 9289031 | 9271689 | 9267961 | 9261699 |
| 36 | .New Mexico | 2117527 | 2118390 | 2116677 | 2113344 |
| 37 | .New York | 20201230 | 20108296 | 19857492 | 19677151 |
| 38 | .North Carolina | 10439414 | 10449445 | 10565885 | 10698973 |
| 39 | .North Dakota | 779091 | 779518 | 777934 | 779261 |
| 40 | .Ohio | 11799374 | 11797517 | 11764342 | 11756058 |
| 41 | .Oklahoma | 3959346 | 3964912 | 3991225 | 4019800 |
| 42 | .Oregon | 4237291 | 4244795 | 4256301 | 4240137 |
| 43 | .Pennsylvania | 13002689 | 12994440 | 13012059 | 12972008 |
| 44 | .Rhode Island | 1097371 | 1096345 | 1096985 | 1093734 |
| 45 | .South Carolina | 5118429 | 5131848 | 5193266 | 5282634 |
| 46 | .South Dakota | 886677 | 887799 | 896164 | 909824 |
| 47 | .Tennessee | 6910786 | 6925619 | 6968351 | 7051339 |
| 48 | .Texas | 29145428 | 29232474 | 29558864 | 30029572 |
| 49 | .Utah | 3271614 | 3283785 | 3339113 | 3380800 |
| 50 | .Vermont | 643085 | 642893 | 646972 | 647064 |
| 51 | .Virginia | 8631384 | 8636471 | 8657365 | 8683619 |
| 52 | .Washington | 7705247 | 7724031 | 7740745 | 7785786 |
| 53 | .West Virginia | 1793755 | 1791420 | 1785526 | 1775156 |
| 54 | .Wisconsin | 5893725 | 5896271 | 5880101 | 5892539 |
| 55 | .Wyoming | 576837 | 577605 | 579483 | 581381 |
US_Population.isna().sum()
Unnamed: 0 0 Unnamed: 1 0 2020 0 2021 0 2022 0 dtype: int64
There are no missing values.
The unnamed colums represent respectively: "Geographic Area" and "April 1, 2020 Estimates Base". They were unnamed because I skipped some rows of the excel file just to have the data frame in a better format.
Since I only want data about each state and not about geographic areas (such US as a whole or the Northeast, West, etc.), I want to rename the column as "State" and then drop the first 5 rows of values that are not useful.
I will then drop April 1, 2020 Estimates Base, 2020 and 2022 columns for the same reason.
US_Population.drop(columns=['Unnamed: 1'], inplace=True)
US_Population = US_Population.drop(US_Population.columns[[1, 3]], axis=1)
US_Population = US_Population.drop(range(5), axis=0)
US_Population.rename(columns={"Unnamed: 0": "State"}, inplace=True)
US_Population.head()
| State | 2021 | |
|---|---|---|
| 5 | .Alabama | 5049846 |
| 6 | .Alaska | 734182 |
| 7 | .Arizona | 7264877 |
| 8 | .Arkansas | 3028122 |
| 9 | .California | 39142991 |
Now I want to clean the values inside the column State because there are some "." that can interfere with data. Then I reset the data frame index so that it starts again from 0 and not from 5.
US_Population['State'] = US_Population['State'].str.replace(".", "")
US_Population.reset_index(drop=True, inplace=True)
US_Population
/tmp/ipykernel_15/1857523339.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
US_Population['State'] = US_Population['State'].str.replace(".", "")
| State | 2021 | |
|---|---|---|
| 0 | Alabama | 5049846 |
| 1 | Alaska | 734182 |
| 2 | Arizona | 7264877 |
| 3 | Arkansas | 3028122 |
| 4 | California | 39142991 |
| 5 | Colorado | 5811297 |
| 6 | Connecticut | 3623355 |
| 7 | Delaware | 1004807 |
| 8 | District of Columbia | 668791 |
| 9 | Florida | 21828069 |
| 10 | Georgia | 10788029 |
| 11 | Hawaii | 1447154 |
| 12 | Idaho | 1904314 |
| 13 | Illinois | 12686469 |
| 14 | Indiana | 6813532 |
| 15 | Iowa | 3197689 |
| 16 | Kansas | 2937922 |
| 17 | Kentucky | 4506589 |
| 18 | Louisiana | 4627098 |
| 19 | Maine | 1377238 |
| 20 | Maryland | 6174610 |
| 21 | Massachusetts | 6989690 |
| 22 | Michigan | 10037504 |
| 23 | Minnesota | 5711471 |
| 24 | Mississippi | 2949586 |
| 25 | Missouri | 6169823 |
| 26 | Montana | 1106227 |
| 27 | Nebraska | 1963554 |
| 28 | Nevada | 3146402 |
| 29 | New Hampshire | 1387505 |
| 30 | New Jersey | 9267961 |
| 31 | New Mexico | 2116677 |
| 32 | New York | 19857492 |
| 33 | North Carolina | 10565885 |
| 34 | North Dakota | 777934 |
| 35 | Ohio | 11764342 |
| 36 | Oklahoma | 3991225 |
| 37 | Oregon | 4256301 |
| 38 | Pennsylvania | 13012059 |
| 39 | Rhode Island | 1096985 |
| 40 | South Carolina | 5193266 |
| 41 | South Dakota | 896164 |
| 42 | Tennessee | 6968351 |
| 43 | Texas | 29558864 |
| 44 | Utah | 3339113 |
| 45 | Vermont | 646972 |
| 46 | Virginia | 8657365 |
| 47 | Washington | 7740745 |
| 48 | West Virginia | 1785526 |
| 49 | Wisconsin | 5880101 |
| 50 | Wyoming | 579483 |
In order to have a better visualization of the dataframes containing US data, I merge them in a single one.
US_merged = pd.merge(US_States, US_GDP_State, on='State')
US_merged = pd.merge(US_merged, US_Unemployment, on='State')
US_merged = pd.merge(US_merged, US_Population, on='State')
US_merged.info()
US_merged
<class 'pandas.core.frame.DataFrame'> Int64Index: 51 entries, 0 to 50 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 51 non-null object 1 Abbreviation 51 non-null object 2 GDP_2021 51 non-null float64 3 Unemployment 51 non-null float64 4 2021 51 non-null int64 dtypes: float64(2), int64(1), object(2) memory usage: 2.4+ KB
| State | Abbreviation | GDP_2021 | Unemployment | 2021 | |
|---|---|---|---|---|---|
| 0 | Alabama | AL | 254109.7 | 3.450000 | 5049846 |
| 1 | Alaska | AK | 57349.4 | 6.425000 | 734182 |
| 2 | Arizona | AZ | 420026.7 | 4.950000 | 7264877 |
| 3 | Arkansas | AR | 148676.1 | 4.058333 | 3028122 |
| 4 | California | CA | 3373240.7 | 7.366667 | 39142991 |
| 5 | Colorado | CO | 436359.5 | 5.408333 | 5811297 |
| 6 | Connecticut | CT | 298395.2 | 6.300000 | 3623355 |
| 7 | Delaware | DE | 81160.0 | 5.391667 | 1004807 |
| 8 | District of Columbia | DC | 153670.5 | 6.675000 | 668791 |
| 9 | Florida | FL | 1255558.3 | 4.600000 | 21828069 |
| 10 | Georgia | GA | 691626.9 | 3.941667 | 10788029 |
| 11 | Hawaii | HI | 91096.1 | 5.775000 | 1447154 |
| 12 | Idaho | ID | 96282.8 | 3.591667 | 1904314 |
| 13 | Illinois | IL | 945673.8 | 6.125000 | 12686469 |
| 14 | Indiana | IN | 412975.2 | 3.616667 | 6813532 |
| 15 | Iowa | IA | 216860.2 | 4.275000 | 3197689 |
| 16 | Kansas | KS | 191380.6 | 3.283333 | 2937922 |
| 17 | Kentucky | KY | 237182.0 | 4.708333 | 4506589 |
| 18 | Louisiana | LA | 258571.3 | 5.475000 | 4627098 |
| 19 | Maine | ME | 77963.3 | 4.633333 | 1377238 |
| 20 | Montana | MT | 58699.8 | 3.400000 | 1106227 |
| 21 | Nebraska | NE | 146285.4 | 2.516667 | 1963554 |
| 22 | Nevada | NV | 194486.6 | 7.250000 | 3146402 |
| 23 | New Hampshire | NH | 99673.3 | 3.533333 | 1387505 |
| 24 | New Jersey | NJ | 682945.9 | 6.391667 | 9267961 |
| 25 | New Mexico | NM | 109582.8 | 6.808333 | 2116677 |
| 26 | New York | NY | 1901296.5 | 6.958333 | 19857492 |
| 27 | North Carolina | NC | 662120.8 | 4.841667 | 10565885 |
| 28 | North Dakota | ND | 63559.6 | 3.683333 | 777934 |
| 29 | Ohio | OH | 756617.2 | 5.166667 | 11764342 |
| 30 | Oklahoma | OK | 215336.3 | 3.891667 | 3991225 |
| 31 | Oregon | OR | 272190.9 | 5.241667 | 4256301 |
| 32 | Maryland | MD | 443929.9 | 5.783333 | 6174610 |
| 33 | Massachusetts | MA | 641332.2 | 5.758333 | 6989690 |
| 34 | Michigan | MI | 572205.8 | 5.933333 | 10037504 |
| 35 | Minnesota | MN | 412458.6 | 3.425000 | 5711471 |
| 36 | Mississippi | MS | 127307.7 | 5.625000 | 2949586 |
| 37 | Missouri | MO | 358572.0 | 4.375000 | 6169823 |
| 38 | Pennsylvania | PA | 844496.5 | 6.416667 | 13012059 |
| 39 | Rhode Island | RI | 66570.9 | 5.675000 | 1096985 |
| 40 | South Carolina | SC | 269802.5 | 4.016667 | 5193266 |
| 41 | South Dakota | SD | 61684.7 | 3.133333 | 896164 |
| 42 | Tennessee | TN | 427125.5 | 4.316667 | 6968351 |
| 43 | Texas | TX | 2051768.6 | 5.708333 | 29558864 |
| 44 | Utah | UT | 225340.3 | 2.716667 | 3339113 |
| 45 | Vermont | VT | 37103.8 | 3.450000 | 646972 |
| 46 | Virginia | VA | 604957.6 | 3.925000 | 8657365 |
| 47 | Washington | WA | 677489.5 | 5.283333 | 7740745 |
| 48 | West Virginia | WV | 85434.2 | 5.075000 | 1785526 |
| 49 | Wisconsin | WI | 368611.3 | 3.816667 | 5880101 |
| 50 | Wyoming | WY | 41510.2 | 4.516667 | 579483 |
And I rename some columns.
US_merged = US_merged.rename(columns={'GDP_2021': 'GDP($)', 'Unemployment': 'Unemployment(%)'})
US_merged.columns.values[4] = "Population"
US_merged
| State | Abbreviation | GDP($) | Unemployment(%) | Population | |
|---|---|---|---|---|---|
| 0 | Alabama | AL | 254109.7 | 3.450000 | 5049846 |
| 1 | Alaska | AK | 57349.4 | 6.425000 | 734182 |
| 2 | Arizona | AZ | 420026.7 | 4.950000 | 7264877 |
| 3 | Arkansas | AR | 148676.1 | 4.058333 | 3028122 |
| 4 | California | CA | 3373240.7 | 7.366667 | 39142991 |
| 5 | Colorado | CO | 436359.5 | 5.408333 | 5811297 |
| 6 | Connecticut | CT | 298395.2 | 6.300000 | 3623355 |
| 7 | Delaware | DE | 81160.0 | 5.391667 | 1004807 |
| 8 | District of Columbia | DC | 153670.5 | 6.675000 | 668791 |
| 9 | Florida | FL | 1255558.3 | 4.600000 | 21828069 |
| 10 | Georgia | GA | 691626.9 | 3.941667 | 10788029 |
| 11 | Hawaii | HI | 91096.1 | 5.775000 | 1447154 |
| 12 | Idaho | ID | 96282.8 | 3.591667 | 1904314 |
| 13 | Illinois | IL | 945673.8 | 6.125000 | 12686469 |
| 14 | Indiana | IN | 412975.2 | 3.616667 | 6813532 |
| 15 | Iowa | IA | 216860.2 | 4.275000 | 3197689 |
| 16 | Kansas | KS | 191380.6 | 3.283333 | 2937922 |
| 17 | Kentucky | KY | 237182.0 | 4.708333 | 4506589 |
| 18 | Louisiana | LA | 258571.3 | 5.475000 | 4627098 |
| 19 | Maine | ME | 77963.3 | 4.633333 | 1377238 |
| 20 | Montana | MT | 58699.8 | 3.400000 | 1106227 |
| 21 | Nebraska | NE | 146285.4 | 2.516667 | 1963554 |
| 22 | Nevada | NV | 194486.6 | 7.250000 | 3146402 |
| 23 | New Hampshire | NH | 99673.3 | 3.533333 | 1387505 |
| 24 | New Jersey | NJ | 682945.9 | 6.391667 | 9267961 |
| 25 | New Mexico | NM | 109582.8 | 6.808333 | 2116677 |
| 26 | New York | NY | 1901296.5 | 6.958333 | 19857492 |
| 27 | North Carolina | NC | 662120.8 | 4.841667 | 10565885 |
| 28 | North Dakota | ND | 63559.6 | 3.683333 | 777934 |
| 29 | Ohio | OH | 756617.2 | 5.166667 | 11764342 |
| 30 | Oklahoma | OK | 215336.3 | 3.891667 | 3991225 |
| 31 | Oregon | OR | 272190.9 | 5.241667 | 4256301 |
| 32 | Maryland | MD | 443929.9 | 5.783333 | 6174610 |
| 33 | Massachusetts | MA | 641332.2 | 5.758333 | 6989690 |
| 34 | Michigan | MI | 572205.8 | 5.933333 | 10037504 |
| 35 | Minnesota | MN | 412458.6 | 3.425000 | 5711471 |
| 36 | Mississippi | MS | 127307.7 | 5.625000 | 2949586 |
| 37 | Missouri | MO | 358572.0 | 4.375000 | 6169823 |
| 38 | Pennsylvania | PA | 844496.5 | 6.416667 | 13012059 |
| 39 | Rhode Island | RI | 66570.9 | 5.675000 | 1096985 |
| 40 | South Carolina | SC | 269802.5 | 4.016667 | 5193266 |
| 41 | South Dakota | SD | 61684.7 | 3.133333 | 896164 |
| 42 | Tennessee | TN | 427125.5 | 4.316667 | 6968351 |
| 43 | Texas | TX | 2051768.6 | 5.708333 | 29558864 |
| 44 | Utah | UT | 225340.3 | 2.716667 | 3339113 |
| 45 | Vermont | VT | 37103.8 | 3.450000 | 646972 |
| 46 | Virginia | VA | 604957.6 | 3.925000 | 8657365 |
| 47 | Washington | WA | 677489.5 | 5.283333 | 7740745 |
| 48 | West Virginia | WV | 85434.2 | 5.075000 | 1785526 |
| 49 | Wisconsin | WI | 368611.3 | 3.816667 | 5880101 |
| 50 | Wyoming | WY | 41510.2 | 4.516667 | 579483 |
As I said before, I want to add the column assets from the secondary Fortune 1000 dataframe so that I can make more complete analysis on Fortune 1000 companies.
Fortune_1000.reset_index(drop=True, inplace=True)
Fortune_1000_sec.reset_index(drop=True, inplace=True)
Fortune_1000["Assets"] = Fortune_1000_sec["assets"]
Fortune_1000.info()
Fortune_1000
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Company 1000 non-null object 1 Revenue 1000 non-null float64 2 Profit 1000 non-null float64 3 Employees 1000 non-null int64 4 Sector 1000 non-null object 5 City 1000 non-null object 6 State 1000 non-null object 7 New_Entry 1000 non-null object 8 CEO_Founder 1000 non-null object 9 CEO_Woman 1000 non-null object 10 Profitable 1000 non-null object 11 CEO 1000 non-null object 12 Website 1000 non-null object 13 Assets 1000 non-null float64 dtypes: float64(3), int64(1), object(10) memory usage: 109.5+ KB
| Company | Revenue | Profit | Employees | Sector | City | State | New_Entry | CEO_Founder | CEO_Woman | Profitable | CEO | Website | Assets | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Walmart | 572754.0 | 13673.0 | 2300000 | Retailing | Bentonville | AR | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com | 244860.0 |
| 1 | Amazon | 469822.0 | 33364.0 | 1608000 | Retailing | Seattle | WA | no | no | no | yes | Andrew R. Jassy | www.amazon.com | 420549.0 |
| 2 | Apple | 365817.0 | 94680.0 | 154000 | Technology | Cupertino | CA | no | no | no | yes | Timothy D. Cook | www.apple.com | 351002.0 |
| 3 | CVS Health | 292111.0 | 7910.0 | 258000 | Health Care | Woonsocket | RI | no | no | yes | yes | Karen Lynch | https://www.cvshealth.com | 232999.0 |
| 4 | UnitedHealth Group | 287597.0 | 17285.0 | 350000 | Health Care | Minnetonka | MN | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com | 212206.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Vizio Holding | 2124.0 | -39.4 | 800 | Industrials | Irvine | CA | no | yes | no | no | William W. Wang | https://www.vizio.com | 935.8 |
| 996 | 1-800-Flowers.com | 2122.2 | 118.7 | 4800 | Retailing | Jericho | NY | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com | 1076.7 |
| 997 | Cowen | 2112.8 | 295.6 | 1534 | Financials | New York | NY | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com | 8748.8 |
| 998 | Ashland | 2111.0 | 220.0 | 4100 | Chemicals | Wilmington | DE | no | no | no | yes | Guillermo Novo | https://www.ashland.com | 6612.0 |
| 999 | DocuSign | 2107.2 | -70.0 | 7461 | Technology | San Francisco | CA | no | no | no | no | Allan C. Thygesen | https://www.docusign.com | 2541.3 |
1000 rows × 14 columns
Just for design and legibility reasons, I re-order columns of my dataframe.
new_column_order = ['Company', 'Revenue', 'Assets','Profit','Employees', 'Sector', 'State','City', 'CEO_Founder', 'CEO_Woman','New_Entry', 'Profitable', 'CEO', 'Website']
Fortune_1000 = Fortune_1000.reindex(columns=new_column_order)
Fortune_1000
| Company | Revenue | Assets | Profit | Employees | Sector | State | City | CEO_Founder | CEO_Woman | New_Entry | Profitable | CEO | Website | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Walmart | 572754.0 | 244860.0 | 13673.0 | 2300000 | Retailing | AR | Bentonville | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com |
| 1 | Amazon | 469822.0 | 420549.0 | 33364.0 | 1608000 | Retailing | WA | Seattle | no | no | no | yes | Andrew R. Jassy | www.amazon.com |
| 2 | Apple | 365817.0 | 351002.0 | 94680.0 | 154000 | Technology | CA | Cupertino | no | no | no | yes | Timothy D. Cook | www.apple.com |
| 3 | CVS Health | 292111.0 | 232999.0 | 7910.0 | 258000 | Health Care | RI | Woonsocket | no | yes | no | yes | Karen Lynch | https://www.cvshealth.com |
| 4 | UnitedHealth Group | 287597.0 | 212206.0 | 17285.0 | 350000 | Health Care | MN | Minnetonka | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Vizio Holding | 2124.0 | 935.8 | -39.4 | 800 | Industrials | CA | Irvine | yes | no | no | no | William W. Wang | https://www.vizio.com |
| 996 | 1-800-Flowers.com | 2122.2 | 1076.7 | 118.7 | 4800 | Retailing | NY | Jericho | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com |
| 997 | Cowen | 2112.8 | 8748.8 | 295.6 | 1534 | Financials | NY | New York | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com |
| 998 | Ashland | 2111.0 | 6612.0 | 220.0 | 4100 | Chemicals | DE | Wilmington | no | no | no | yes | Guillermo Novo | https://www.ashland.com |
| 999 | DocuSign | 2107.2 | 2541.3 | -70.0 | 7461 | Technology | CA | San Francisco | no | no | no | no | Allan C. Thygesen | https://www.docusign.com |
1000 rows × 14 columns
In order to start finding asnwers to my research questions, I want to understand which companies generate the highest revenues and how revenues are distribute overall. I use interactive histograms and boxplots from the plotly library.
top_10_revenues = Fortune_1000.nlargest(10, 'Revenue')
fig = px.bar(top_10_revenues, x='Company', y='Revenue', labels={'Revenue': 'Revenues (millions $)'}, title='Top 10 Companies by Revenues')
fig.update_layout(xaxis_tickangle=-45)
fig.show()
revenue_bins = np.arange(0, 600000, 20000)
fig = px.histogram(Fortune_1000, x='Revenue', nbins=50, labels={'Revenue': 'Revenues (millions $)'}, title='Distribution of Revenues')
fig.update_layout(xaxis_range=[0, Fortune_1000['Revenue'].max()], xaxis_tickangle=-45)
fig.show()
fig = px.box(Fortune_1000, x='Revenue')
fig.update_layout(title='Distribution of Revenues', xaxis_title='Revenues (millions $)', yaxis_title='Distribution')
fig.show()
As we can see, there are some exceptions in terms of revenues, but companies are mainly distribute on ranges of less than 20 billions of revenues.
Now I want to understand which sector dominates the Fortune 1000 list. I use an histogram in orther to show the distribution of companies by sector and a scatter plot with revenues on a logarithmic scale, that helps design way better graph in case of outliers or big difference between higher and lower values.
sector_summary = Fortune_1000['Sector'].value_counts()
fig = px.bar(sector_summary, x=sector_summary.index, y=sector_summary.values, color=sector_summary.index)
fig.update_layout(xaxis_title='Sector', yaxis_title='Frequency', title='Distribution of Companies by Sector',
xaxis_tickangle=-90)
fig.show()
revenues_by_sector = Fortune_1000.groupby('Sector')['Revenue'].sum()
sectors = revenues_by_sector.index
total_revenues = revenues_by_sector.values
unique_sectors = Fortune_1000['Sector'].unique()
colors = plt.cm.tab20(np.linspace(0, 1, len(unique_sectors)))
plt.figure(figsize=(12, 6))
for sector, color in zip(unique_sectors, colors):
sector_data = Fortune_1000[Fortune_1000['Sector'] == sector]
frequencies = np.arange(1, len(sector_data) + 1)
plt.scatter(frequencies, sector_data['Revenue'], color=color, label=sector)
plt.xlabel('Frequency')
plt.ylabel('Revenues (millions $)')
plt.title('Companies Revenues by Sector')
plt.xticks(rotation=90)
plt.yscale('log')
plt.legend()
plt.tight_layout()
plt.show()
I use a log Scale so that I can better visualize data that I have. I can do it because revenues are only positive numbers (while it won't be possible with profits because losses are included).
It seems that the sectors with the largest number of companies also dominate in terms of revenues. But I want to search more deeply into this.
I compute the total revenues for each sector in order to have even better data.
revenues_by_sector = Fortune_1000.groupby('Sector')['Revenue'].sum()
sectors = revenues_by_sector.index
total_revenues = revenues_by_sector.values
fig = px.bar(x=sectors, y=total_revenues, color=sectors)
fig.update_layout(
title='Total Revenues by Sector',
xaxis=dict(title='Sector'),
yaxis=dict(title='Total Revenues (millions $)'),
xaxis_tickangle=-45
)
fig.show()
fig = go.Figure(data=[go.Pie(labels=sectors, values=total_revenues, marker=dict(colors=colors), textinfo='label')])
fig.update_layout(title='Total Revenues by Sector')
fig.show()
Financials, Retailing, Health Care, Technology and Energy are unquestioned leaders of the Fortune 1000 (and of economy in general, I would add) for total revenues.
Well, now I want to add profits to the analysis. I try to study the correlation of revenues and profits. Here I cannot use a log scale because profits can be negative (and there are some losses), so the graph is not the best one.
Using some advanced functions from the "scipy.stats" module, I can compute the regression line that studies the relationship between revenues and profits. I also add the correlation coefficient (r), that is the numerical measure of the direction and strength of the linear association, and R^2, that measures how well the regression line fits the data.
revenues = Fortune_1000['Revenue']
profits = Fortune_1000['Profit']
slope, intercept, r_value, p_value, std_err = stats.linregress(revenues, profits)
regression_line = lambda x: slope * x + intercept
correlation_coefficient = np.corrcoef(revenues, profits)[0, 1]
coefficient_of_determination = r_value**2
scatter_trace = go.Scatter(
x=revenues,
y=profits,
mode='markers',
marker=dict(color='blue', opacity=0.5),
name='Data'
)
regression_trace = go.Scatter(
x=revenues,
y=regression_line(revenues),
mode='lines',
line=dict(color='red'),
name='Regression Line'
)
layout = go.Layout(
title='Revenues vs Profits',
xaxis=dict(title='Revenues (millions $)'),
yaxis=dict(title='Profits (millions $)'),
showlegend=False,
annotations=[
dict(
x=0.05,
y=0.95,
xref='paper',
yref='paper',
text=f'y = {slope:.2f}x + {intercept:.2f}',
showarrow=False,
font=dict(size=12),
align='left'
),
dict(
x=0.05,
y=0.90,
xref='paper',
yref='paper',
text=f'Correlation (r): {correlation_coefficient:.2f}',
showarrow=False,
font=dict(size=12),
align='left'
),
dict(
x=0.05,
y=0.85,
xref='paper',
yref='paper',
text=f'Squared correlation (R^2): {coefficient_of_determination:.2f}',
showarrow=False,
font=dict(size=12),
align='left'
)
]
)
fig = go.Figure(data=[scatter_trace, regression_trace], layout=layout)
fig.show()
It seems that there is a moderate positive correlation between revenues and profits (r=0.65). The linear model fits fairly well the data (R^2=0.42).
I also want to analyse profits distribution so I use histograms and densityplots.
top_10_profitable = Fortune_1000.nlargest(10, 'Profit')
bar_trace = go.Bar(
x=top_10_profitable['Company'],
y=top_10_profitable['Profit'],
marker=dict(color='blue')
)
layout = go.Layout(
title='Top 10 Most Profitable Companies',
xaxis=dict(title='Company'),
yaxis=dict(title='Profits (millions $)'),
showlegend=False,
xaxis_tickangle=-90
)
fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
histogram_trace = go.Histogram(
x=Fortune_1000['Profit'],
nbinsx=50,
marker=dict(color='blue', line=dict(color='black', width=1))
)
layout = go.Layout(
title='Distribution of Profits',
xaxis=dict(title='Profits (millions $)'),
yaxis=dict(title='Frequency'),
showlegend=False
)
fig = go.Figure(data=[histogram_trace], layout=layout)
fig.show()
plt.figure(figsize=(10, 6))
sns.kdeplot(Fortune_1000['Profit'], fill=True)
plt.xlabel('Profits (millions $)')
plt.ylabel('Density')
plt.title('Density Plot of Profits')
plt.show()
I can analyse profitabily between sectors to see which are most profitable. Let's see with some other graphs.
profits_by_sector = Fortune_1000.groupby('Sector')['Profit'].sum()
sectors = profits_by_sector.index
total_profits = profits_by_sector.values
colors = [
'#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
'#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
'#000080', '#800080', '#008000', '#800000', '#0000ff'
]
bar_trace = go.Bar(
x=sectors,
y=total_profits,
marker=dict(color=colors[:len(sectors)])
)
layout = go.Layout(
title='Total Profits by Sector',
xaxis=dict(title='Sector'),
yaxis=dict(title='Profit'),
xaxis_tickangle=-45
)
fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
profits_by_sector = Fortune_1000.groupby('Sector')['Profit'].sum()
sectors = profits_by_sector.index
total_profits = profits_by_sector.values
colors = [
'#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
'#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
'#000080', '#800080', '#008000', '#800000', '#0000ff'
]
pie_trace = go.Pie(
labels=sectors,
values=total_profits,
marker=dict(colors=colors[:len(sectors)]),
textinfo='none'
)
layout = go.Layout(
title='Total Profits by Sector'
)
fig = go.Figure(data=[pie_trace], layout=layout)
fig.show()
As predictable, leader sectors in revenues are also the most profitable: Financials, Technology, Health Care, Energy and Retailing.
Now lets give a look to profit margins. Profit Margin = (Net Profit / Revenue) * 100
It is a financial metric that measures the profitability of a company. It is expressed as a percentage and represents the portion of profit for each dollar of revenue. The higher, the better. A higher profit margin indicates that a company is generating more profit relative to its revenues, which is generally seen as a positive indicator of financial health and efficiency.
I add profit margins column to my dataframe Fortune_1000.
Fortune_1000['Profit_Margin(%)'] = (Fortune_1000['Profit'] / Fortune_1000['Revenue']) * 100
Fortune_1000
| Company | Revenue | Assets | Profit | Employees | Sector | State | City | CEO_Founder | CEO_Woman | New_Entry | Profitable | CEO | Website | Profit_Margin(%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Walmart | 572754.0 | 244860.0 | 13673.0 | 2300000 | Retailing | AR | Bentonville | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com | 2.387238 |
| 1 | Amazon | 469822.0 | 420549.0 | 33364.0 | 1608000 | Retailing | WA | Seattle | no | no | no | yes | Andrew R. Jassy | www.amazon.com | 7.101413 |
| 2 | Apple | 365817.0 | 351002.0 | 94680.0 | 154000 | Technology | CA | Cupertino | no | no | no | yes | Timothy D. Cook | www.apple.com | 25.881793 |
| 3 | CVS Health | 292111.0 | 232999.0 | 7910.0 | 258000 | Health Care | RI | Woonsocket | no | yes | no | yes | Karen Lynch | https://www.cvshealth.com | 2.707875 |
| 4 | UnitedHealth Group | 287597.0 | 212206.0 | 17285.0 | 350000 | Health Care | MN | Minnetonka | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com | 6.010146 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Vizio Holding | 2124.0 | 935.8 | -39.4 | 800 | Industrials | CA | Irvine | yes | no | no | no | William W. Wang | https://www.vizio.com | -1.854991 |
| 996 | 1-800-Flowers.com | 2122.2 | 1076.7 | 118.7 | 4800 | Retailing | NY | Jericho | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com | 5.593252 |
| 997 | Cowen | 2112.8 | 8748.8 | 295.6 | 1534 | Financials | NY | New York | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com | 13.990913 |
| 998 | Ashland | 2111.0 | 6612.0 | 220.0 | 4100 | Chemicals | DE | Wilmington | no | no | no | yes | Guillermo Novo | https://www.ashland.com | 10.421601 |
| 999 | DocuSign | 2107.2 | 2541.3 | -70.0 | 7461 | Technology | CA | San Francisco | no | no | no | no | Allan C. Thygesen | https://www.docusign.com | -3.321944 |
1000 rows × 15 columns
profit_margins_by_sector = Fortune_1000.groupby('Sector')['Profit_Margin(%)'].mean()
sectors = profit_margins_by_sector.index
margin_values = profit_margins_by_sector.values
colors = [
'#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
'#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
'#000080', '#800080', '#008000', '#800000', '#0000ff'
]
bar_trace = go.Bar(
x=sectors,
y=margin_values,
marker=dict(color=colors[:len(sectors)])
)
layout = go.Layout(
xaxis=dict(title='Sector'),
yaxis=dict(title='Profit Margin (%)'),
title='Profit Margins by Sector',
xaxis_tickangle=-45
)
fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
Analysing profit margins by sector, we can see that Retailing, Health Care, Technology, and Energy (unquestioned leaders for total revenues and profits) performs almost as the average of other sectors regarding profit margins. But there is Financials sector that is always best-performer and has about 5% profit margin more than every other sector.
Another interesting financial ratio that can be measured is ROA (Return On Assets) that evaluates a company's efficiency in generating profits from its assets. ROA = Net Income / Average Total Assets
Generally, profits can be considered equivalent to net income so I will perform an analysis using data from the dataframe Fortune_1000.
Fortune_1000['ROA(%)'] = (Fortune_1000['Profit'] / Fortune_1000['Assets']) * 100
Fortune_1000
| Company | Revenue | Assets | Profit | Employees | Sector | State | City | CEO_Founder | CEO_Woman | New_Entry | Profitable | CEO | Website | Profit_Margin(%) | ROA(%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Walmart | 572754.0 | 244860.0 | 13673.0 | 2300000 | Retailing | AR | Bentonville | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com | 2.387238 | 5.584007 |
| 1 | Amazon | 469822.0 | 420549.0 | 33364.0 | 1608000 | Retailing | WA | Seattle | no | no | no | yes | Andrew R. Jassy | www.amazon.com | 7.101413 | 7.933439 |
| 2 | Apple | 365817.0 | 351002.0 | 94680.0 | 154000 | Technology | CA | Cupertino | no | no | no | yes | Timothy D. Cook | www.apple.com | 25.881793 | 26.974205 |
| 3 | CVS Health | 292111.0 | 232999.0 | 7910.0 | 258000 | Health Care | RI | Woonsocket | no | yes | no | yes | Karen Lynch | https://www.cvshealth.com | 2.707875 | 3.394864 |
| 4 | UnitedHealth Group | 287597.0 | 212206.0 | 17285.0 | 350000 | Health Care | MN | Minnetonka | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com | 6.010146 | 8.145387 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Vizio Holding | 2124.0 | 935.8 | -39.4 | 800 | Industrials | CA | Irvine | yes | no | no | no | William W. Wang | https://www.vizio.com | -1.854991 | -4.210301 |
| 996 | 1-800-Flowers.com | 2122.2 | 1076.7 | 118.7 | 4800 | Retailing | NY | Jericho | no | no | no | yes | Christopher G. McCann | https://www.1800flowers.com | 5.593252 | 11.024426 |
| 997 | Cowen | 2112.8 | 8748.8 | 295.6 | 1534 | Financials | NY | New York | no | no | no | yes | Jeffrey Solomon | https://www.cowen.com | 13.990913 | 3.378749 |
| 998 | Ashland | 2111.0 | 6612.0 | 220.0 | 4100 | Chemicals | DE | Wilmington | no | no | no | yes | Guillermo Novo | https://www.ashland.com | 10.421601 | 3.327284 |
| 999 | DocuSign | 2107.2 | 2541.3 | -70.0 | 7461 | Technology | CA | San Francisco | no | no | no | no | Allan C. Thygesen | https://www.docusign.com | -3.321944 | -2.754496 |
1000 rows × 16 columns
Before analysing ROA, I want to give a look to total assets data. I want to see top 10 companies by total assets.
top_10_assets = Fortune_1000.nlargest(10, 'Assets')
bar_trace = go.Bar(
x=top_10_assets['Company'],
y=top_10_assets['Assets']
)
layout = go.Layout(
xaxis=dict(title='Company'),
yaxis=dict(title='Total Assets'),
title='Top 10 Companies by Total Assets',
xaxis_tickangle=-45
)
fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
As can be seen, all top 10 companies by total assets are from the financials sector with total assets of trillions of dollars. All of them provide financial services: banks, investment banks, insurance companies, holdings. Also, many of them are United States government-sponsored. Now, as done with revenues and profits, I want to see total assets by sectors.
assets_by_sector = Fortune_1000.groupby('Sector')['Assets'].sum().reset_index()
fig = px.bar(assets_by_sector, x='Sector', y='Assets', color='Sector', title='Total Assets by Sector',
labels={'Assets': 'Total Assets'}, color_discrete_sequence=colors)
fig.update_layout(xaxis_tickangle=-45)
fig.show()
assets_by_sector = Fortune_1000.groupby('Sector')['Assets'].sum().reset_index()
sectors = assets_by_sector['Sector']
assets_values = assets_by_sector['Assets']
fig = go.Figure(data=[go.Pie(labels=sectors, values=assets_values, marker=dict(colors=colors), textinfo='none')])
fig.update_layout(title='Total Assets by Sector')
fig.show()
Financials dominates in terms of total assets due to the nature of their businesses: around 66% of total assets of Fortune 1000 companies are owned by companies from the Financials sector.
roa_by_sector = Fortune_1000.groupby('Sector')['ROA(%)'].mean()
sectors = roa_by_sector.index
roa_values = roa_by_sector.values
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#b0c4de', '#ffb6c1', '#c0c0c0', '#66cdaa', '#ffa500',
'#ffd700', '#d2691e', '#40e0d0', '#808080', '#008080',
'#000080', '#800080', '#008000', '#800000', '#0000ff']
bar_trace = go.Bar(
x=sectors,
y=roa_values,
marker=dict(color=colors)
)
layout = go.Layout(
xaxis=dict(title='Sector'),
yaxis=dict(title='Return on Assets (ROA)'),
title='ROA by Sector',
xaxis_tickangle=-45
)
fig = go.Figure(data=[bar_trace], layout=layout)
fig.show()
But, considering ROA by sector, things change. In fact, even though they have so many assets, companies from financials sector have lower ROA than almost every other sector. Apparel sector (designi, manufacturing, and retailing of clothing, footwear, and accessories) dominates in terms of efficient use of assets.
correlation_matrix = Fortune_1000[['Assets', 'Revenue', 'Profit']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix: Assets, Revenues, and Profits')
plt.tight_layout()
plt.show()
Here I created a correlation matrix that shows that there is a positive correlation between every metric combined: revenues, assets, and profits. The stronger positive relationship is the one between revenues and profits (0.65), while it seems that there is only a weak correlation between revenues and assets (0.3).
Regarding the number of employees, I want to see which companies have the highest number of employees and the highest revenues per employees.
top_10_employees = Fortune_1000.nlargest(10, 'Employees')
fig = px.bar(top_10_employees, x='Company', y='Employees', color='Company')
fig.update_layout(
xaxis_title='Company',
yaxis_title='Number of Employees',
title='Top 10 Companies by Number of Employees',
xaxis_tickangle=-45
)
fig.show()
Walmart and Amazon have the highest number of employees, but in terms of Revenues per Employee, there are other companies dominating with impressing numbers.
Fortune_1000['Revenue_per_Employee'] = Fortune_1000['Revenue'] / Fortune_1000['Employees']
top_10_revenue_per_employee = Fortune_1000.nlargest(10, 'Revenue_per_Employee')
fig = px.bar(top_10_revenue_per_employee, x='Company', y='Revenue_per_Employee', color='Company')
fig.update_layout(
xaxis_title='Company',
yaxis_title='Revenue per Employee (millions $)',
title='Top 10 Companies by Revenue per Employee',
xaxis_tickangle=-45
)
fig.show()
Some numbers are actually really high: tens of millions of dollars per employee.
To see if actually revenues depend on the number of employees, I want to see the correlation between them.
revenues = Fortune_1000['Revenue']
employees = Fortune_1000['Employees']
slope, intercept, r_value, p_value, std_err = stats.linregress(employees, revenues)
r_squared = r_value ** 2
fig = go.Figure()
fig.add_trace(go.Scatter(
x=employees,
y=revenues,
mode='markers',
marker=dict(color='blue', opacity=0.5),
name='Data Points'
))
x = np.linspace(0, employees.max(), 100)
y = intercept + slope * x
fig.add_trace(go.Scatter(
x=x,
y=y,
mode='lines',
line=dict(color='red'),
name='Regression Line'
))
fig.update_layout(
xaxis_title='Number of Employees',
yaxis_title='Revenues (millions $)',
title='Correlation between Revenues and Number of Employees',
showlegend=True
)
regression_equation = f'Regression Line: y = {slope:.2f}x + {intercept:.2f}'
correlation_text = f'Correlation (r): {r_value:.2f}'
r_squared_text = f'R-squared: {r_squared:.2f}'
fig.add_annotation(
xref='paper',
yref='paper',
x=0.98,
y=0.9,
text=regression_equation,
showarrow=False,
font=dict(color='white'),
align='right',
bgcolor='black',
bordercolor='black',
borderwidth=2
)
fig.add_annotation(
xref='paper',
yref='paper',
x=0.98,
y=0.85,
text=correlation_text,
showarrow=False,
font=dict(color='white'),
align='right',
bgcolor='black',
bordercolor='black',
borderwidth=2
)
fig.add_annotation(
xref='paper',
yref='paper',
x=0.98,
y=0.8,
text=r_squared_text,
showarrow=False,
font=dict(color='white'),
align='right',
bgcolor='black',
bordercolor='black',
borderwidth=2
)
fig.show()
The graph is stretched because of outliers in the number of employees (we saw that these companies are Walmart and Amazon), but a positive quite strong relationship can be found anyway (0.73 correlation r). R-squared is 0.54, meaning that data fits the regression line in a quite good way. So it is possible to conclude that, generally speaking, the higher the number of employees, the higher the revenues.
Too have a look on the geographical dimension and distribution of Fortune 1000 companies, I want to draw a map of United States with the distribution of headquarters.
In order to draw the map, I downloaded the shapefile of US State Boundaries from this website: https://hub.arcgis.com/datasets/1612d351695b467eba75fdf82c10884f/explore?location=16.090374%2C0.315549%2C2.18&showTable=true.
Before starting, I create a copy of the dataframes in order to avoid errors.
Fortune_1000_copy = cp.deepcopy(Fortune_1000)
Fortune_1000_copy.head()
| Company | Revenue | Assets | Profit | Employees | Sector | State | City | CEO_Founder | CEO_Woman | New_Entry | Profitable | CEO | Website | Profit_Margin(%) | ROA(%) | Revenue_per_Employee | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Walmart | 572754.0 | 244860.0 | 13673.0 | 2300000 | Retailing | AR | Bentonville | no | no | no | yes | C. Douglas McMillon | https://www.stock.walmart.com | 2.387238 | 5.584007 | 0.249023 |
| 1 | Amazon | 469822.0 | 420549.0 | 33364.0 | 1608000 | Retailing | WA | Seattle | no | no | no | yes | Andrew R. Jassy | www.amazon.com | 7.101413 | 7.933439 | 0.292178 |
| 2 | Apple | 365817.0 | 351002.0 | 94680.0 | 154000 | Technology | CA | Cupertino | no | no | no | yes | Timothy D. Cook | www.apple.com | 25.881793 | 26.974205 | 2.375435 |
| 3 | CVS Health | 292111.0 | 232999.0 | 7910.0 | 258000 | Health Care | RI | Woonsocket | no | yes | no | yes | Karen Lynch | https://www.cvshealth.com | 2.707875 | 3.394864 | 1.132213 |
| 4 | UnitedHealth Group | 287597.0 | 212206.0 | 17285.0 | 350000 | Health Care | MN | Minnetonka | no | no | no | yes | Andrew P. Witty | www.unitedhealthgroup.com | 6.010146 | 8.145387 | 0.821706 |
US_merged_copy = cp.deepcopy(US_merged)
US_merged_copy.head()
| State | Abbreviation | GDP($) | Unemployment(%) | Population | |
|---|---|---|---|---|---|
| 0 | Alabama | AL | 254109.7 | 3.450000 | 5049846 |
| 1 | Alaska | AK | 57349.4 | 6.425000 | 734182 |
| 2 | Arizona | AZ | 420026.7 | 4.950000 | 7264877 |
| 3 | Arkansas | AR | 148676.1 | 4.058333 | 3028122 |
| 4 | California | CA | 3373240.7 | 7.366667 | 39142991 |
current_dir = os.getcwd()
shapefile_path = os.path.join(current_dir, 'US_State_Boundaries.shp')
us_map = gpd.read_file(shapefile_path)
state_counts = Fortune_1000_copy['State'].value_counts()
all_states = pd.Series(0, index=us_map['STATE_ABBR'], name='State_Count')
all_states.update(state_counts)
merged_map = us_map.merge(all_states, left_on='STATE_ABBR', right_index=True, how='left')
color_map = LinearColormap(['white', 'blue'], vmin=0, vmax=merged_map['State_Count'].max(), caption='State Count')
m = folium.Map(location=[37, -95], zoom_start=4)
folium.GeoJson(merged_map,
name='Choropleth',
style_function=lambda feature: {
'fillColor': color_map(feature['properties']['State_Count']),
'color': 'black',
'weight': 0.8,
'fillOpacity': 0.7
},
highlight_function=lambda x: {'weight': 3},
tooltip=folium.features.GeoJsonTooltip(fields=['NAME', 'State_Count'],
aliases=['State', 'Count'],
sticky=True,
style='background-color: white; color: #333333;'),
show=False
).add_to(m)
color_map.add_to(m)
m